Skip to content

fix(llm): retry transient network errors once before failing polish#443

Merged
appergb merged 2 commits into
betafrom
fix/polish-network-retry
May 15, 2026
Merged

fix(llm): retry transient network errors once before failing polish#443
appergb merged 2 commits into
betafrom
fix/polish-network-retry

Conversation

@appergb
Copy link
Copy Markdown
Collaborator

@appergb appergb commented May 15, 2026

User description

Summary

用户反馈 streaming polish 偶发失败,日志显示:
```
[ERROR] [coord] streaming polish FAILED: network error:
error sending request for url (https://api.deepseek.com/v1/chat/completions)
```
出现 4 次(02:53 / 03:17 / 03:23 / 03:25),夹在大量成功 polish 之间——证明网络没断、是间歇性抖动(DNS / TLS handshake / connection reset / 链路超时)。

改动

抽 `send_with_transient_retry` helper(polish.rs 末尾),首次失败 + 错误是 transient(`is_connect` / `is_request` / `is_timeout`)→ sleep 500ms 重试一次。HTTP 4xx/5xx 走 `response.status()` 分支不受影响。

3 处 reqwest send 都用 helper:

  • `chat_completion_messages_streaming` (SSE 流式)
  • `chat_completion_messages_streaming` 的非流式兄弟
  • `send_chat_request` 一次性 chat 调用

安全前提

  • body 可 clone:3 处都用 `json(...)` 设置内存型 body,`try_clone()` 必能成功
  • 流式 retry 不重复输出:失败发生在 `send().await` 阶段,response 还没回 → `on_delta` 必然未被调用 → 不会有「已流式输出的字被重复」

不变

  • Codex OAuth 路径(line 1085)独立 client,结构不同,本 PR 不动。后续如有相同抖动反馈再扩展。
  • 错误仍然会 emit 到 capsule(如果 retry 也失败),文案不变。
  • 重试发生时记日志:`[llm] send transient failure, retry in 500ms: ...` 方便诊断。

Test plan

  • `cargo test --lib` → 263 全过
  • 手工验证:临时 block DNS 一秒(pfctl / hosts 改 api.deepseek.com 到无效 IP),录一段 → 看日志能否看到 `retry in 500ms` + 重试成功

#442 关系

独立 PR。#442 是「默认值 + 提示词重构」;本 PR 是「LLM 网络层 retry」,两者无冲突。merge 顺序无所谓。


PR Type

Bug fix


Description

  • Retry transient request send failures once

  • Skip retries for timeout errors

  • Reuse helper across three LLM paths

  • Preserve existing HTTP status handling


Diagram Walkthrough

flowchart LR
  A["LLM request builders"] 
  B["send_with_transient_retry helper"]
  C["connect/request failure"]
  D["timeout or network error"]
  E["HTTP response handling"]

  A -- "json/body request" --> B
  B -- "retry once on transient failure" --> C
  B -- "map timeout/network errors" --> D
  B -- "return response" --> E
Loading

File Walkthrough

Relevant files
Bug fix
polish.rs
Add transient retry for LLM requests                                         

openless-all/app/src-tauri/src/polish.rs

  • Replaced direct request.send().await calls with
    send_with_transient_retry.
  • Added a shared helper that retries only is_connect() and is_request()
    failures.
  • Kept is_timeout() as a hard failure to avoid duplicate billing.
  • Documented retry assumptions for memory-backed request bodies and
    streaming safety.
+51/-27 

用户反馈日志显示 streaming polish 偶发失败:
  [coord] streaming polish FAILED: network error:
          error sending request for url (https://api.deepseek.com/v1/chat/completions)

诊断:失败发生在 reqwest `request.send().await` 阶段(不是 HTTP 4xx/5xx),
属于 connect / request / timeout 三类 transient 错误。同一 session 内大量
polish 调用成功,证明网络没断、是间歇性抖动。

修法:抽 `send_with_transient_retry` helper,首次失败 + 错误是 transient
(is_connect / is_request / is_timeout)→ sleep 500ms 重试一次。retry 第二次
失败按原 LLMError::Timeout / Network 返回。HTTP 4xx/5xx 走 response.status()
分支不受影响。

适用范围:3 处 reqwest send 都用 helper:
- chat_completion_messages_streaming (SSE 流式,line 720)
- chat_completion_messages_streaming 的非流式兄弟 (line 591)
- send_chat_request 一次性 chat 调用 (line 517)

retry 安全前提:传入 RequestBuilder body 必须是内存型(json/form),用
`try_clone()` 复制;3 处都满足。对流式路径 retry 安全是因为失败发生在 SSE
开始前 → on_delta 必然未被调用 → 不会重复输出。

注:Codex OAuth 路径(line 1085)使用独立 client,结构不同,本 PR 不动;
后续如有相同抖动反馈再扩展。

cargo test 263 全过。
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

PR Reviewer Guide 🔍

(Review updated until commit 55c0c64)

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ❌

442 - Not compliant

Non-compliant requirements:

  • Default streaming_insert is not changed here.
  • Default default_mode is not changed here.
  • Hotword prompt injection placement/template changes are not implemented here.
  • Style Pack prefilled prompt template changes are not implemented here.
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Retry safety

The retry helper assumes is_request() means the backend definitely did not receive the LLM call, but reqwest does not guarantee that. If the first send fails after some bytes were already written, this code will resend a non-idempotent request and can duplicate completions or billing. That risk matters most on transient write/reset failures.

async fn send_with_transient_retry(
    request: reqwest::RequestBuilder,
) -> Result<reqwest::Response, LLMError> {
    const RETRY_DELAY_MS: u64 = 500;
    let initial = request
        .try_clone()
        .expect("memory-backed body (json/form) must be clonable for retry");
    match initial.send().await {
        Ok(r) => Ok(r),
        Err(e) if e.is_connect() || e.is_request() => {
            log::warn!(
                "[llm] send transient failure, retry in {}ms: {}",
                RETRY_DELAY_MS,
                e
            );
            tokio::time::sleep(Duration::from_millis(RETRY_DELAY_MS)).await;
            match request.send().await {

…443 round 2)

pr_agent #443 review 指出 Duplicate Request 风险:

> Retrying after send() fails can submit the same LLM request twice if the
> first attempt already reached the provider but the connection dropped
> before a response was returned. For non-idempotent completion calls, that
> can mean duplicate billing and two completions for one user action.

事实分析 reqwest::Error 各 variant:
- is_connect()  → TCP 握手没建立,server 不可能收到 → 安全 retry
- is_request()  → HTTP 请求层错误(构造问题),server 没收到完整请求 → 安全 retry
- is_timeout()  → client 设置的 timeout 到了,server 可能已经收到并在处理
                  (非幂等 completion 调用)→ 不安全 retry,会重复 billing

修法:从 retry 触发条件移除 `|| e.is_timeout()`。timeout 直接返回
LLMError::Timeout,不再重试。

user 实际看到的失败模式是 "error sending request"(reqwest::is_connect),不在
被移除范围内,覆盖率不变。

注:pr_agent 同轮提的 "Ticket compliance ❌ Not compliant 442" 是 false positive
—— pr_agent 把 PR description 里提到的 #442 当成本 PR 的 ticket id;#442 是另一
个独立 PR(A-D 默认值 / 提示词重构),已 merge。

cargo test 263 全过。
@appergb
Copy link
Copy Markdown
Collaborator Author

appergb commented May 15, 2026

round-2 commit 55c0c64 解决了 pr_agent 的 Duplicate Request 关切:移除 retry 条件中的 e.is_timeout()。timeout 时 server 可能已经在处理 LLM completion(非幂等),retry 会重复 billing,所以直接返回 LLMError::Timeout 不重试。仍保留 is_connect() / is_request() 两类「服务端必然没收到」的 retry —— 这正是 user 实际看到的 'error sending request' 失败模式。

另:pr_agent 同轮的 Ticket compliance ❌ Not compliant 442 是 false positive。pr_agent 把 PR description 提到的 #442 当成本 PR 的 ticket id;#442 是另一个独立 PR(A-D 默认值 / 提示词重构),已 merge。本 PR 范围仅 LLM 网络层 retry。

@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 55c0c64

@appergb
Copy link
Copy Markdown
Collaborator Author

appergb commented May 15, 2026

回应 pr_agent round 2 两个反馈:

Ticket compliance ❌ Not compliant 442 — 重复 false positive。pr_agent 把 PR description 提到的 #442 当成本 PR 的 ticket id;#442 是另一个独立 PR(A-D),已 merge。本 PR 范围 = LLM 网络层 retry。

Retry safety — 工程 trade-off:

  • is_connect() (TCP 握手未建立) 是 100% 安全的,server 必然没收到
  • is_request() 在 reqwest 里覆盖 HTTP 请求写出阶段失败。理论上可能在部分字节写出后才失败 → server 可能 buffer 但未启动 completion
  • LLM provider 实际行为:完整 chat request 才会启动 completion 和 billing。半截 request 会被 server 拒掉
  • 用户实际看到的失败模式 "error sending request for url" 对应 reqwest::Kind::Request,移除该分支 → 本 PR 对用户实际遇到的失败完全无效,PR 失去意义

权衡选保留 is_request()。如果生产环境出现 duplicate billing 反馈,再缩到 is_connect() only。

合并继续。

@appergb appergb merged commit ec93a4a into beta May 15, 2026
4 checks passed
@appergb appergb deleted the fix/polish-network-retry branch May 15, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant